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Abstract — Given a graph with positive edge weights and a 
positive integer m, the Constrained Forest Problem (CFP) 
problem seeks a minimum-weight spanning subgraph each of 
whose components contains at least m vertices. Such a subgraph 
is called an m -forest. We present a genetic algorithm (GA) for 
CFP, which is NP-hard forme"4. Our GA evolves good spanning 
trees, as determined by the weight of the m -forest into which 
it can be partitioned by a simple greedy algorithm. Genetic 
operators include mutation, which replaces a spanning tree 
edge by a different edge that connects the same pair of 
components, and recombination, which combines the edge sets 
of two spanning trees to produce two new spanning trees. The 
GA discovers m-forests that are significantly better than those 
identified by best-known approximation algorithms for CFP, 
and identifies optimal solutions in small problems. 

Index Terms — Genetic Algorithms, Spanning Forests, 
Constrained Forest Problem, Network Design Problems, 
Combinatorial Optimization. 

I. Introduction 

Given a graph with positive edge weights, the Constrained 
Forest Problem (CFP) problem seeks a minimum-weight 
spanning subgraph each of whose components contains at 
least m>\ vertices. CFP belongs to a class of network design 
problems that have applications in such domains as VLSI 
layout and telecommunications network design. Network 
design problems assume this form: Given an edge- weighted 
graph G=(V,E) and a demand function f: 2 V — »{0, 1}, find a 
minimum-weight set of edges that for all SaV, contains at 
least /(5) edges from 5(5), where 5(5) contains exactly one 
endpoint in 5. The CFP is modeled by the demand function 
/(5)=1 if and only if 0<l5km. A forest is feasible if and only if 
/(7)=0 for every tree T. Imelienska et al. [9] show that CFP is 
NP-hard for m>A and present a greedy heuristic for CFP that 
produces solutions within a factor of two of the optimum. 
Laszlo and Mukherjee [12] present a second greedy heuristic 
(which we refer to as HEF for heaviest edge first) that 
produces solutions at least as good as and often better than 
those produced by [9] , and then locate both greedy heuristics 
within a family of heuristics all of which are 2-approximate for 
CFP[13]. Goemans and Williamson [6] treat CFP as aspecial 
case of network design problems with demand functions that 
are downwards monotone. They show their generalization of 
the greedy method of [9] to be 2-approximate. Goemans and 
Williamson [7] subsequently obtain a (2-1/1 Vl)-approximation 
method for CFP using a primal-dual approximation framework. 
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Bar- Yehuda et al. [2] present a version of this method using 
the local ratio framework. Results on complexity and 
approximation of the CFP are presented in [3]. This paper 
presents a genetic algorithm (GA) to identify good solutions 
to CFP. The greedy HEF algorithm [12] achieves its 2- 
approximate performance guarantee by operating on the 
edges of the graph's minimum spanning tree (MST). We show 
that there exist spanning trees that serve as better starting 
points, indeed, that there exist spanning trees that yield 
optimal solutions under HEF Our GA is designed to evolve 
such spanning trees. Each individual in a population encodes 
a spanning tree of the input graph whose fitness reflects the 
quality of the m-forest partition that results under greedy 
HEF. Genetic operators include mutation, which replaces a 
spanning tree edge by a different edge that crosses the same 
cut, and recombination, which combines the edge sets of 
two spanning trees to produce two new spanning trees. Other 
studies have employed similar chromosome representations 
for GAs targeting network design problems (for example see 
[1], [11], [15], and [16]). Notably, our GA differs from other 
approaches in its choice of genotype- to-phenotype mapping 
and in its recombination operator. The paper is organized as 
follows. Section 2 motivates our use of the GA for this 
problem. Section 3 describes the GA. Section 4 presents 
examples on small data sets to illustrate our method. Section 
5 presents and discusses the results of experiments designed 
to evaluate the GAs efficacy. Section 6 concludes with some 
observations. 

II. Motivation 

A forest is called an m-forest if each of its trees contains 
at least m vertices for given m>\. The heaviest-edge-first 
{HEF) algorithm partitions a spanning tree T=(V,E) into an m- 
forest spanning vertex set V [12]. This greedy algorithm 
iteratively processes spanning tree edges in decreasing 
weight order, and removes an edge if and only if it connects 
two components of size at least m (see Figure 1). The input is 
the edge set E and integer m>\\ the output is a subset of E 
representing an m-forest spanning vertex set V. We use HEF(7) 
to denote the m-forest produced by applying HEF to spanning 
tree T where the value of m is assumed. 



47 



vc ACEEE 



ACEEE Int. J. on Information Technology, Vol. 01 , No. 03, Dec 201 1 



Input: 


edge setE of spanning 


treeT", 


and 


integer m>0. 


Output: 


an m- forest Ffor T". 










heaviest 


edge firstfFrJrr) { 










F«- 


fi 










whil 


e(F*.S3){ 
e «— some heaviestedc 


= in = 










if [e connects ttvo trees of size at 1 


east m 


inF] 




F«-F-{e}; 












} 












return Fj 










> 













Figure 1. Heaviest edge first (HEF) algorithm. 

We can use the HEF algorithm in a 2-approximate greedy 
heuristic for CFP: First compute the MST of the input graph, 
and then apply HEF to the MST edges. Although the m- 
forests produced by this heuristic are no worse than twice 
optimal, in practice they may weigh close to this upper bound 
and it is not difficult to construct examples that do so. This 
suggests that there may exist spanning trees other than the 
MST which yield m-forests of less weight under HEF. In the 
remainder of this section, we show that for any input graph 
G, there exists a spanning tree TofG such that the m-forest 
HEF(7) is optimal. This motivates the use of a genetic 
algorithm to find such HEF-optimal trees in the search space 
of all spanning trees of G.. Let us say that a tree is small if it 
contains fewer than m vertices, and large otherwise. An edge 
e in tree Tis removable if each of the two trees in the graph 
T—{ e } is large. We refer to a tree or a forest as irreducible if 
it contains no removable edges (an irreducible forest 
comprises only irreducible trees). It is evident that the forests 
i 7 produced by HEF are irreducible. Suppose to the contrary 
that F contains some tree Tthat is not irreducible, and let 
edge eirbe some removable edge that connects two large 
subtrees C l and C r When HEF processed edge e, e must 
have connected two subtrees C\ and C" containing C l and 
C as subgraphs, implying that both C\ and C" 2 are large. 
Consequently, edge e would have been removed by HEF, 
contradicting our choice of edge e. We refer to a tree as star- 
shaped if it contains some vertex v such that every subtree to 
which v is connected by an edge is small. Vertex v is called a 
center vertex of the tree. It is shown in [14] that a tree is 
irreducible if and only if it is star-shaped. We use this result 
to prove the following theorem: Theorem 1: Given graph G, 
there exists some spanning tree T of G such that applying 
HEF to r yields an optimal m-forest. Proof: Given an optimal 
m-forest F*, we construct a tree 7" such that HEF(7) yields 
F*. Suppose that F* contains the trees T p . . . , T . Let vertex v. 
be some center vertex of tree T. for i=\,...,k. First, add the 
edges of F* to T. Then, for i=2, ...k, add an edge connecting 
vertex v. to v r To see that HEF(7)=F*, let us refer to the edges 
of F* as original edges, and the edges (v.,Vj) that join two 
center vertices as bridge edges. Consider applying HEF to 
tree T, and let e be the current edge HEF is processing. There 
are two cases: 

• If e is an original edge, suppose esT.. Edge e connects 
two components one of which contains the center vertex v 
and the other of which, call it C , does not. Since tree T. is 
irreducible, C is small, implying that edge e is retained by 
HEF. 
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• If e is a bridge edge, it connects some tree T. to the 
component C that contains tree T y T. is large because it 
belongs to an m-forest (F*), and C is large because its 
subgraph T is large; accordingly, edge e is removed by HEF. 
It follows that the HEF(7>F*. □ 

A spanning tree Tfor which HEF(7) is optimal is referred 
to as HEF-optimal. The tree construction of the previous 
proof can be generalized to show that multiple HEF-optimal 
trees may exist for a given input graph. The genetic algorithm 
described in the following section seeks an HEF-optimal tree 
in the search space of trees that span the input graph. Each 
chromosome represents a spanning tree T, and its fitness is a 
function of the weight of the m-forest HEF(7). The genetic 
operators are designed to move the search from the initial 
population, made up of copies of the input graph's MST, 
toward the HEF-optimal trees in the search space. 

m. Genetic Algorithm 

Genetic algorithms find wide use in combinatorial 
optimization problems [8]. A GA evolves a population of 
chromosomes (or individuals) that represent feasible 
solutions. Each chromosome stores a set of genes whose 
values collectively determine the chromosome's fitness. 
During each generation, genetic operators — selection, 
recombination, mutation, and replacement — are applied to 
the current population to produce the next generation. 
Selection selects individuals from the current population with 
probability proportional to their fitness values. 
Recombination operates on two selected chromosomes, 
swapping portions of each to produce two new chromosomes. 
Mutation alters the value of a gene. Replacement merges the 
current population and its offspring from which it selects the 
set of individuals to make up the next generation. 
Recombination focuses the search on promising 
neighborhoods of the feasible region while mutation widens 
the search to unexplored regions, thereby minimizing the risk 
of premature convergence to inferior solutions. The goal of a 
GA is to identify good solutions through this evolutionary 
process. We summarize the GA, then elaborate each phase, 
and then address time complexity. 
Given a graph G create the initial population of m 
chromosomes. 
For each of ,/V generations { 

Selection: Select m parents based on the fitness, with 
replacement. 

Recombination: Pair parents randomly and perform 
recombination to produce offspring. 

Mutation: Mutate offspring. 

Replacement: Combine parents and offspring, and 
form the next generation of size m. 

} 
Representation: Each chromosome represents a spanning 
tree of the input graph. The storage requirement for 
chromosomes is linear in the number of vertices in the graph. 
Initialization: The initial generation consists of ju copies of 
MSTs. The MST is likely to contribute much of the genetic 
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material (i.e., edges) comprising an optimal solution. We can 
rely on mutation to supply new genetic material not present 
in the MST. 

Fitness: The weight of the m-forest that results when HEF is 
applied to the set of edges represented by a chromosome is 
taken to be the chromosome's raw fitness. To obtain a scaled 
fitness value, we subtract the raw fitness from the total cost 
of the MST, and then linearly scale this value such that the 
maximum scaled fitness in the population is/times the average 
scaled fitness, where we use the value/=1.8 [8]. 
Selection: Using roulette-wheel sampling with replacement, 
we select u. chromosomes from the current population and 
place them in a mating buffer to serve as parents. The 
probability of a chromosome being selected in any trial is 
proportional to its scaled fitness value. 
Recombination: Given two spanning trees with edge sets E 
and E that serve as parents, our recombination operator 
produces as offspring two spanning trees over the same set 
of edges £ u£,. Each offspring retains the set of edges E ] E 1 
common to both parents while exchanging some of the edges 
belonging to the symmetric difference (E—ES) vj(E^-E^). 
The recombination operator works as follows. First generate 
a uniform random number p between and 0.5, and then 
create an edge set F containing a proportion p of the edges 
of E-E selected at random. For each edge esF, remove e 
from E , thereby separating E into two trees. Assign the 
label A to one of the trees and the label B to the other (the 
labels represent a partition or cut of the vertex set into two 
nonempty classes). Next, add edge e to edge set E ,, and then 
traverse the unique cycle in E to which e belongs in order to 
discover some edge /, distinct from e, that connects an A 
vertex to a B vertex. Lastly, remove edge /from edge set E 
and add it to E . By construction, E and E remain spanning 
trees, and edge/is drawn from the edge set E—E r Each of 
the two offspring resulting from recombination represents a 
spanning tree and hence encodes a feasible solution. 
Mutation: With a small probability, each edge in a 
chromosome is removed and the two resulting components 
into which the spanning tree is partitioned are reconnected 
using a randomly selected edge. We bias the mutation 
operator to increase the likelihood of obtaining spanning 
trees with two desirable properties: First, spanning trees 
should be of low degree; vertices with high degree result in 
large star-shaped trees that restrict the number of edges HEF 
can remove. We facilitate this in our mutation operator by 
removing edges connected to vertices of high degree with 
higher probability. Secondly, since the optimal m-forest is 
likely to contain mostly edges of low weight, we use a mutation 
operator that is biased toward selecting light edges to replace 
mutated edges as follows: Randomly choose a fixed 
proportion of the vertices from each of the two components; 
call the subsets of selected vertices A and B. Select the lightest 
of the c edges connecting A and B to bridge the components. 
Replacement: Replacement follows the elitist strategy. 
Specifically, the best ju/2 parents and best ju/2 offspring are 
merged to form the population for the next generation. 
Complexity: Assume that graph G has \E\ edges and I VI verti 
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ces. Cost of initialization depends on the cost of MST con- 
struction. A single MST of G can be constructed in 0(LEI + I VI 
log I VI) time using Prim's algorithm and Fibonacci heaps. Fit 

ness calculation is dominated by the cost of applying HEF 
to a spanning tree, which takes O(m-LEI) time. Given popula- 
tion size ju, formation of the offspring population via selec- 
tion takes 0(/i log ju) time, and elitist replacement also takes 
0(/y log ju) time. Recombination of two spanning trees with 
edge sets E and E, adds an edge eeE —E to E , and then 
traverses the resulting cycle to discover an edge of E—E to 
replace e. Such edge exchange may take 0(1 VI) time since the 
cycle maybe as large as ®(IV1). Moreover, since O(IVI) pairs 
of edges may be exchanged in this way, recombination takes 
O(IVP) time in the worst-case sense. Mutation replaces an 
edge e by another edge that crosses the same cut as e. If 
each of the two components that e connects are size ©(I VI), 
there maybe as many as O(IVP) edges to consider. However, 
because mutation considers only a fixed proportion of the 
candidate edges, the constant implicit in 0(1 VI 2 ) can be made 
small. 

IV. Example 

We ran our GA on some small data sets to provide an 
illustration of the solutions it produces, and to compare the 
solutions to those obtained by the HEF method alone. 
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Figure 2. Optimal 4-forests with n=13. 

Figure 2 shows optimal 4-forests obtained by our GA on two 
different complete 13-vertex graphs (panels a and b) using 
the Euclidean metric; it also presents the corresponding 4- 
forests produced by applying HEF to the graphs' MSTs. 
Edges removed from the spanning trees by HEF are shown 
as broken lines. Notice that the optimal solutions retain a 
subset of edges from the MSTs but also include non-MST 
edges. For n=13 and m=4 the number of feasible m-forests is 
48,764 (see the appendix of [14] for a method to compute the 
number of feasible m-forests); the number of distinct spanning 
trees is over 1.79E+12. With an initial population of 20 MSTs, 
our GA found the optimal solution within 40 generations (thus 
exploring at most 800 spanning trees) in each of 20 consecutive 
runs. Each run took less than 2 seconds on a Pentium 
processor running Java bytecode at 2.5 Ghz. 
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V. RESULTS 

This section summarizes the results of our investigations 
for larger data sets. We used the TSP-1060 benchmark dataset 
taken from [18] to investigate the effectiveness of our 
recombination and mutation operators. The dataset contains 
1060 two-dimensional points. Each point represents a vertex 
in a complete graph where edge weight is the Euclidean 
distance between pairs of vertices. 

A. Recombination operator 

Figure 3 presents the best solutions found by the GA 
over 800 generations in 6 different runs. The thin lines present 
the results of three runs where the recombination operator 
was applied with probability 0. 8; the thick lines present results 
of runs without recombination. All other GAparameters are 
identical for both sets of runs. Notably, recombination 
consistently yields better quality solutions; over 30 GA runs, 
the mean decrease in weight in the best m-forests identified 
was 2. 16 times greater when recombination was used. 




=^&*t*M a 



Figure 3. Best solutions found with recombination. 

B. Biased mutation to reduce high degree vertices 

We hypothesize that better quality solutions result when 
edges connected to vertices of high degree are selectively 
removed since high-degree vertices result in large star shaped 
trees and restrict the number of edges that HEF can remove. 
Results of our computational experiments to investigate this 
hypothesis are presented in Figure 4. We define edge degree 
to be the maximum of the degree of an edge's two endpoints. 
We bias mutation to favor removal of edges of high degree. 
Specifically edges of degree two are mutated with probability 
p hw , edges of highest degree with probability p Mh , and all 
remaining edges with probability p . In Figure 4, the thin 
line represents the best m-forest found over successive 
generations for the parameters p h =0.001, p med =0.002, and 
p h . A =0.004. For comparison, the thick lines present results 
with the mutation probability set to constant 

Pu m =P me rP,u S ,r°- m and P^=P„ m rP hig „= om ^ respectively. 
Results on 30 runs of the GA produced results consistent 
with our hypothesis. The mean decrease in weight in the best 
m-forests identified was 2. 12 times greater when the mutation 
was biased as to reduce high degree vertices. 

C. Biased mutation to introduce low-weight edges 

We also conjecture that the optimal m-forest is likely to 
contain mostly edges of low weight and bias our mutation 
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operator to replace mutated edges by edges of low weight. 
Figure 5 demonstrates the effect of selectively adding edges 
of low weight during mutation. The thin lines present the best 
solutions found over 800 generations on 3 runs of our GA 
under the following biased mutation scheme: The lightest 
edge bridging 40% randomly selected sample of vertices in 
each component resulting from the removal of an edge replaces 
the removed edge. The thick lines present the best solutions 
found when an arbitrary edge spanning the two components 
is selected. Notice that a bias towards introducing lighter edges 
results in better quality solutions. 
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Figure 4. Biased mutation replaces edges connected to vertices of 
high degree. 




Figure 5. Biased mutation introduces low weight edges. 

V. Conclusions 

Our paper presents a genetic algorithm for finding good 
solutions for the NP-hard CFP problem. The GA exploits an 
existing heuristic for CFP in the mapping from genotype 
(spanning trees) to phenotype (m-forests), while 
outperforming the best-known heuristics when applied to 
large problems. When applied to small problems, the GA 
rapidly converges to an optimal solution. The representation 
of individuals and the choice of genetic operators that act on 
that representation are critical decisions in the design of a 
GA. We have chosen to represent individuals as edge sets 
that comprise spanning trees. This representation maps to 
an m-forest under the greedy HEF heuristic. Our primary 
genetic operators are recombination and mutation. Our 
recombination operator is a novel method for combining two 
individuals to form two new offspring composed of the same 
set of edges as their parents — the resulting offspring are 
feasible yet different from their parents. Mutation, which we 
explore in both biased and unbiased variants, transforms an 
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individual into another by exchanging edges while ensuring 
the feasibility of the new individual. The representation of 
spanning trees as edge sets has been used to solve other 
intractable problems such as the degree-constrained minimum 
spanning tree [16] and minimum routing cost spanning tree 
[10] problems. Our GA's use of edge sets for CFP differs from 
prior use in two respects. First, because the genotype 
(spanning trees) and phenotype (m-forests) are of different 
type, the genotype-to-phenotype mapping is nontrivial. 
Second, the GA employs an innovative recombination operator 
on edge sets. These two novelties suggest that GAs of similar 
design might be useful for other optimization problems 
involving forests and other graph structures derivable from 
spanning trees. 
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